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Abstract. Computing frequent itemsets in transactional databases is 
a vital but computationally expensive task. Measuring the difference 
r^ of two datasets is often done by computing their respective frequent 

—5 itemsets despite high computational cost. This paper proposes a linear 

programming-based approach to this problem and shows that there ex- 
ists a distance measure for transactional database that relies on closed 
frequent itemsets but does not require their generation. 
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psj 1 Introduction 
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^J Suppose that we are given a dataset that consists of transactions (tuples) each 
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containing one or more items. Frequent itemsets are subsets that appear in a 
large fraction of dataset tuples, where the exact fraction value is defined by the 
user and is called support. Frequent pattern mining was proposed by Agrawal 
\mJ [Agrawal, Srikant 1994| for shopping basket analysis; both frequent itemsets and 

fvj association rules were introduced in this paper. Many additional algorithms have 

T-H been proposed other the years, such as FPGrowth [Han, Pei, Yin 2000] , Eclat 

L| |Zaki 2000] . Genmax |Gouda,Zaki 2005] and others. This problem has numer- 

ous applications in both theoretical and practical knowledge discovery, but its 
computational complexity is another matter. It has been shown that the eas- 
ier problem of counting maximal frequent itemsets, i.e. itemsets that are not a 
subset of other frequent itemsets, is #P-complete (see jYang 2004] ). 

In this paper, we propose to use a linear programming approach for obtain- 
ing answers to some questions in frequent itemset mining without computing 
frequent itemsets. We introduce a proper distance measure that reflects the dif- 
ference between two datasets by comparing their closed frequent itemsets; this 
measure is also computable in polynomial time and allows to determine whether 
or not the changes in a dataset over time modified its frequent itemsets. 

This paper is organized as follows. Section [2| formally defines the problem of 
frequent itemset mining and outlines the questions we further address. Section 
[3] introduces the convex polyhedron model for transactional datasets and proves 



its correctness. Section |4] explains how polyhedron models of two datasets can 
be used in order to find the distance between them. Finally, Section [5] extends 
the polyhedron model to the case of multiple copies of items in transactions. 

2 Problem statement 

Let D he a dataset of size m, composed of m transactions {ii, . . . ,im}- Each 
transaction contains items that form a subset of some finite set V. The size 
n := \V\ is called the cardinality of D. The number of items may vary from 
transaction to transaction, but the items in each transaction form a set, i.e. they 
cannot appear more than once. Additionally, we have 

support value 1 < S < m. 

A set / of items {itemset) is called frequent if it appears in at least S transactions 
as a subset. A frequent itemset / is closed if no frequent set containing / has 
the same support. A frequent itemset / is maximal if no itemset containing it is 
frequent. Trivially, every maximal frequent itemset is closed and frequent, and 
every closed frequent itemset is frequent, but the inverse is not always true. 
The basic task in this setting is 

to find (all, closed, maximal) frequent itemsets in a given dataset. 

Three main approaches to this problem are Apriori |Agrawal,Srikant 1994| , FP- 
Growth [Han, Pei, Yin 2000] and Eclat |Zaki 2000] . This task is computationally 
expensive as the number of frequent itemsets in D can be exponential in dataset 
cardinality. Complexity of this task is preserved even if we demand frequent 
itemsets to be closed or maximal. 

The focus of this paper is frequent and closed frequent itemsets of D. Since 
enumerating all frequent itemsets or even all closed frequent itemsets is hard, 
we ask other, less specific questions, about the set. 

1. Given two datasets Di and D2, how big is the difference between their re- 
spective sets CFi and CF2 of closed frequent itemsets? 

2. Does changing support modify the set CF, and if so, to what extent? 



3 Polyhedron representation 

3.1 Binary dataset representation 

Let I? be a dataset containing transactions {ii, . . . , i™}, where each transaction 
ti contains a subset of items from the set V = {vi, . . . ,«„}, represented by a 
binary matrix M = (mij) where mij = 1 if and only if Vj £ ti. Every j'-th column 
of the matrix M is a binary vector that represents all occurrences of item Vj in 
D. Every i-th row of M is a binary vector that represents transaction tj of D. 



D = 



transaction # 


items 


1 
2 


bread, milk 
milk, butter 



V = {bread, milk, butter} 

M = 



1 1 
1 1 



Fig. 1. Binary matrix format of the dataset D. 

M ■ 
Fig. 2. Inverse binary matrix of the dataset D. 



00 110 
1000 1 



Example 1 FigureY^ shows a binary format of a dataset with two transactions 
and three items and its matrix M where the items are numbered as vi —"bread", 
V2 ="milk" and V3 ="butter". 



We assume that 



no item is present in all transactions. 



(1) 



Note that if there are items contained in all transactions of D, we simply remove 
them from the dataset. Every frequent itemset in the modified dataset is trans- 
formed into a frequent itemset in the original dataset simply by adding missing 
items. 



3.2 Inverse binary dataset representation 

We use inverse M of a binary matrix M of a dataset D adjuncted to identity 
matrix I from the right. The purpose of identity matrix is add "counting" el- 
ements to transactions, so that each set of transactions (i.e. set of rows in M) 
is uniquely represented by the corresponding rows of /. Using inverse matrix 
as opposed to the original binary matrix allows us to view transactions sets as 
binary sum of corresponding matrix rows. 

Example 2 Figure [^ shows the inverse binary matrix corresponding to binary 
matrix of Figure [7] with identity matrix I2 attached. 



3.3 Transactions to hyperplanes 

Inverse binary representation AI of a dataset allows us view every transactions as 
a hyperplane in ]R™+" in a natural way - every column of the matrix corresponds 
to its own variable. First n variables, denoted by a;i, . . . ,a;„, describe n items, 
and last m variables, denoted by yi, . . . ,ym, are used for transaction counting. 



Xi X2 Xs 


J/l J/2 


1 


1 


10 


1 _ 



M = 



Fig. 3. Five variables corresponding to a 2 x 5 matrix of Figure [2J 

Denoting z-th row of M by fii and variables (xi, . . . , x„, j/i, . . . , ?/,„) by x, we 
obtain equations 

H, := M^ • (x - xo) = (2) 

for some choice of point Xq. Here, /ii is a normal vector of hyperplane Hi which 
represents a single transaction in the dataset. In our model, intersections of hy- 
perplanes will represent transactions sets. Since we are interested in transaction 
sets of size at least S, additional hyperplane is defined on "counting variables" 
Vi- 

U ■.= yi-\ h Vrn = S (3) 

Example 3 Figure [3\ shows variables corresponding to the inverse matrix of 
Figure [71 



3.4 Polyhedron model of dataset 

The model Here and further, we fix the point xq by setting 

and add positivity constraints for all variables 

I <x,, 0<y,<l 
Observe the upper half-space of every hyperplane Hi 

rrupper \ 

H^ '^'^ := fJ,i ■ Ti > ^ii ■ Xo 
and the upper half-space of the hyperplane U 



(4) 
(5) 



U^ 



2/1 



y,i 



> s 



We use these half-spaces and positivity constrains on Xi,yj to define a convex 
polyhedron in R™+", denoted by P. 



P 



/^i • X > /ii • 1"^, 1 < i < m 

J/l H \- Vrn > S 

Xi > 1, i e {1, . . . ,n} 

< J/j < 1, i e {!,... ,m} 



(6) 



Our goal is to show that this polyhedron represents closed frequent itemsets in 
the given dataset and to use this knowledge to answer various questions about 
these itemsets efficiently. Since polyhedron P is defined by a system of linear 
inequalities, asking a question in a form of linear or quadratic function can be 
solved efficiently using linear programming methods. 



m = 2,n = 3,5 = 1 

( [0, 0, 1, 1, 0] ■ [x^,x2,xs, yi,y2f = xg + y^ < [0, 0, 1, 1, 0] ■ 1^ = 2 
[1,0,0,0,1]- [xi,X2,X3,yi,y2f ^xi+y2 < [1, 0,0,0, 1] • 1^ = 2 
yi+y2>S = l 
1 < a::i,a;2,a;3 

, < yi,j/2 < 1 

Fig. 4. System (l6| for 5 variables. 
Example 4 Figure \A shows system (rol /or the matrix Ai of Figure \A 



Properties In this section we show that there exists a correspondence between 
transaction sets of sufficient size and closed frequent itemsets in the datasets. 
Since every transaction set is represented in our model by an intersection of 
corresponding hyperplanes, we need following claims. 

Claim 5 Let Hi and Hj he two of the hyperplanes defined in ([2| and let i ^ j . 
Then Hi D Hj C H , where H is the hyperplane with normal vector fii + fij . 

Proof. Every point that satisfies equations defining Hi and Hj also satisfies their 
direct sum 

{fij + ^i) ■ (x - xo) = 0, 

which is the equation of H. D 

Claim 6 If xq = 1^ , then Hij = Hid Hj is the set of points 
p= {pi,...,Pn,qi,.. .,qrn) where 

fl, {fi,Vfi,)[k]^l 
\ any, otherwise 

and qi, . . . ,q„i assume values in (— oo, +oo). 

The normal vector of Hij is fii \/ fJ-j- □ 

Proof. Equations defining Hi and Hj for xq = I"'" can be satisfied as equalities 
only if pfe = 1 whenever fii \/ fij = 1, as < Pfc < 1 for all k. If fii — fij = 0, 
any value oi pk in range [l,oo) satisfies both equations. Thus, point p satisfies 
equation 

H ■.= {fi,\/ fij)-x^{fi,\/ fij)-l'^ (7) 

of a hyperplane H and we have Hij C H . 

Then next step is to show that H is contained in Hij . 
Let p = {pi, . . . ,pn,qi, ■ . ■ ,qm) G H. Since values of gi, . . . , g„ are arbitrary for 
Hij, we only need to observe the values of pi, . . . ,p„. Notice that {fiiW fj,j)[k]pk = 
{fii V fij)[k] in case {fii V fij)[k] = 1 for Pfe > 1 if and only if p^ = 1. If, however, 
{fii V fij)[k] = 0, any value oipk satisfies the equation of H. Then in case fii = 1 
or fij = 1, we have pfe = 1 and thus p satisfies equations of Hi and Hj. D 



Claim 7 Vertices of P have integer coordinates. 

Proof. Let p = (pi, . . . ,Pn,qi, . . . ,qm) be a vertex of P. Note that p is a vertex 
if and only if it is not a linear combination of other points in P. 

Assume first that Pi is not integer for some 1 < i < n. For p to be a vertex 
of P it has to lie on the intersection of at least two hyperplanes defining P. 
If these hyperplanes are defined by positivity constraint only, each pi — 1 and 
Qi = 0,1; the claim follows trivially. Otherwise, p lies on some hyperplane Hj. 
Then definition of Hj implies that 

Pi-\ h p„ + gi H h g™ = /iji'^ e Z 

Note that equation ^ implies that qi + ■ ■ ■ + qm G Z and thus pi + ■ ■ ■ + Pn € Z 
Then there exists pk, w.l.o.g. i < k, such that pk is not integer. Then points 

q= {Pi,---,Pi +e,...,pk -e,...,pn,qi,...,qm) 

and 

q = {pi,...,Pi-e,...,Pk +e,...,pn,qi,...,qn) 

also lie in P while p — h{q + s) - a contradiction to p being a vertex of P. 

If qj ^ Z for some j, equation j/i + ■ • • + y™ ~ S of U implies that there exists 
< qi < 1, w.l.o.g. i < j, that is also non- integer, since S is integer. Then points 

q= {pi,...,Pn,qi,---,qi + e,...,qj -e,...,q,n) 

and 

s = {pi, ■ ■ ■ ,Pn, qi,---,qi-£,---,qj +£,■■■, qm) 

also lie in P while p = ^{q + s) - a contradiction to p being a vertex of P. D 

Next, we show that the faces of P represent closed frequent itemsets CF of 
D. 

Claim 8 There exists a face of P corresponding to every closed frequent itemset 
of D, and every face of P corresponds to a closed itemset in D . 

Proof. It follows from Claim [5] that faces of P represent itemsets contained in 
subsets of transactions of D. Indeed, if / is a face of P that w.l.o.g. satisfies 
equations of Hi, ... , Hk as equalities, then the equation of iJi n • • • n H^ satisfies 
k equalities ^^(x — xq) — 0. In this case, a point f £ HiCi ■ ■ ■ Ci Hj. has f[i] = 1 
whenether there 

fixes ()with equality to 1 all coordinates corresponding to items in transaction 
subset {ii, . . . , tk}. Equation ^ ensures that k > S. Then a normal vector f 
of face / has f[j] = 1, j G [!,"], whenether there exists i such that fj.i[j] = 1. 
Frequent itemset contained in this transaction subset is closed since any frequent 
itemset containing it has to be contained in a larger subset of transactions and 
thus has bigger support. 

Let now / be a face of P. Then / satisfies some of the equations in (IgI) as 
equality and the others as sharp inequality. If equalities include one or more 



equations of Hi , then subset of transactions expressed by these hyperplanes has 
size at least S due to (Is]) and therefore corresponds to the inclusion-maximal 
itemset contained in the transaction set. Otherwise, none of the equations ([2| 
is satisfied as equality. Since every item is absent from some transaction, we 
have Xi > \ for all i on / . In this case, / represents the empty itemset whose 
support is undetermined. More than face can correspond to the itemset because 
of different possible values (0 or 1) of yis. D 

Linear transformation To solve the problem of several faces of P representing 
the empty itemset, we apply a linear transformation T : M™+" -^ ]R™+i to the 
polyhedron P, where 

T{y^ + ---+y^)^z, (8) 

Linear transformation of a convex polyhedron preserves linearity and convexity 
(see, e.g. [Barvinok, Pommersheim 1996| ), application of transformation of T to 
P gives as a convex polyhedron 

L = T{P) (9) 

Corollary 9 There exists a function from the faces of L to the closed frequent 
itemsets CF of T. D 

4 Dataset comparison 

Suppose we are given two transactional datasets Di and D2 over items V = 
{vi, . . . , Vjn} (both datasets may also represent the state of a single dataset D 
in different points of time) . We wish to compare sets of closed frequent itemsets 
of Di and D2, denoted by CFi and CF2, for given support values 5*1 and ^2, 
where 5*1 and S2 rnay differ. 

A straightforward method of comparing CFi and CF2 is to compute them 
and then calculate the distance between them according to the chosen metric, 
such as Hamming distance, Manhattan or Euclidean distance if the values are 
numerical etc. However, computing sets CFi and CF2 is an expensive task, as 
the sizes of these sets may very well be exponential in m. 

We propose another approach to solve this problem. First, we adjust the 
smaller of these datasets (say, Z?i) so that |Z?i| = \D2\ by adding |Z?2| — \Di\ 
transactions containing all the items and increasing Si accordingly by setting 
5*1 := 5i + \D2\ — \Di\. Any itemset that was frequent in Di before the adjust- 
ment remains frequent after the adjustment as well. Now, let us observe two 
polyhedra Li and L2 defined in ^ for datasets Di and D2 respectively. In or- 
der to compensate for the difference in support values, we apply a simple linear 
transformation T5 : M" -^ M" defined by 

J2 



e/a 



P ©Q 



Fig. 5. Minkowski sum of two convex polygons. 



where z is the variable defined in ([8]). 

Question of distance between datasets is easily translated to the question 
of polyhedron distance and/or the volume of their difference. Since computing 
the volume of a polytope is NP-hard (see e.g. [Dyer, Frieze 1988] ), we employ 
another kind of distance measure called polyhedra penetration depth. 

Definition 10 Let P and Q he two intersecting polyhedra. The penetration depth 
of P and Q, denoted by PD{P, Q), is the minimum translational distance (trans- 
lation every point a constant distance in a specified direction) that one of the 
polyhedra must undergo to render them disjoint. 

In other words, penetration depth PD{P, Q) is a minimum distance from the 
difference vector between the origins of P and Q, to the surface of the Minkowski 
sum (denoted by ®) of P and —Q. An example of Minkowski sum is given in 
in Figure [5] The computational complexity of computing the Minkowski sum 
is 0{m^) in case when P and Q are convex polyhedra, i.e. polynomial in m 
(see |0'Rourke 1994] for survey of algorithms; fastest algorithms are described 
in [Kaul, O'Connor, Srinivasan 199l| and [Sharir 1987J ). 

With this approach, we can answer affirmatively the following questions 
about Di and D2 (and to obtain the answer in time polynomial in n,m). 

1. Distance between Di and D2 can be computed as 
PD{Ts{L,),Ts{L2)). 

2. The common part of CFi and CF2 is the polyhedron Ts(ii)nTs(£2)- While 
computing the volume of Ts{Li) n Ts{L2) is hard, it is easy to determine 
whether or not this polyhedron is empty by minimizing any suitable objective 
function, for instance / = j/i etc. If the polyhedron is empty, the function is 
infeasible. Since this is an LP task, it can be done in polynomial time. 

3. In a special case where Dx — D2 but w.l.o.g. Si < S2, the polyhedron 
Li © {—L2) is convex. This polyhedron is empty if the change in support 
value does not change frequent itemsets. 

This approach to computing distance between sets of frequent itemsets can be 
employed in inverse frequent itemset mining. This task is defined as the problem 



D' = 



transaction # 


items 


1 
2 


bread, bread, milk, milk 
milk, butter, butter, butter 



items V = {breadi,bread2, milki,milk2, 
butteri, butter2, butters} 



transaction 7^ 


items 


1 
2 


breadi , bread2 , milki , milk2 
milki , butteri , butter2 , butters 



M ■ 



1111000 
0010111 



M 



0000 11110 
110 100001 



Fig. 6. Binary representation of a generalized dataset. 



of generating synthetic datasets from frequent itemsets that have been com- 
puted already. While in general this problem was shown to be NP-complete 
(see |Mielikainen2003] . jCalders 2004] . |Wang, Wu 20051 ), the problem of find- 
ing whether or not the dataset generated already has the same set of frequent 
itemsets is much easier. Indeed, given dataset D and frequent itemsets F, we can 
construct the polyhedron L of frequent itemsets of D and compute Minkowski 
sum L(B{—F) in polynomial time. If this sum is empty, D's closed frequent item- 
sets are entirely described by the set CF. Otherwise, PD{L, CF) is the distance 
between them. 

5 Generalized datasets 



In this section, we take a look at a generalized problem of analyzing frequent 
itemsets in a dataset D where an item may appear more than once in a trans- 
action. 

The exact problem setting is as follows. We have a dataset D = {ti, . . . , tm} 
where each transaction ti is a multiset of items from V = {ui, . . . , w„}. For given 
natural 1 < 5 < m, a multiset / of items is frequent if there are at least S 
instances of it in D. Note that D cannot be represented as a binary matrix 
directly. Instead, we apply the following transformation: 

— If transaction ti contains k copies of item Vj, we replace these copies with 
synthetic items Vj,i, . . . , Vj^^. 

In the end, we obtain a non-generalized dataset that can be represented by a 
binary matrix. Figure [6] show how this change in implemented on a dataset of 
size 2 and cardinality 3. 

After this transformation, we can build system m of linear inequalities for 
the modified matrix inverse. We obtain a convex polyhedron P that represents 
our generalized dataset D. 

Using results of Section [3) we conclude that 



1. P is a convex polyhedron whose faces represent closed frequent itemsets of 
D. 

2. Applying linear transformation Tgi^^ yi) — z to P results in a convex poly- 
hedron L{P) whose faces represent closed frequent itemsets of D. 

Then following results hold for our model. 

1. Enumerating the vertices of P is #P-hard (see [Barvinok, Pommersheim 1996| ). 

2. Computing the volume of P is an NP-hard problem. 

3. The problem of classifying datasets up to equivalence of their frequent item- 
sets is wild. 

4. Penetration depth of two polyhedra Li and L2 can be computed in poly- 
nomial time and it is a measure of distance between the sets of frequent 
itemsets of two datasets or a measure of change of the same dataset over 
time or over different support values. 



6 Conclusions 

This paper proposes a linear programming approach in order to compare closed 
frequent itemsets of two datasets; such an approach can also be applied to the 
case of a single dataset that changes over time. Systems of linear inequalities are 
used to express closed frequent itemsets as convex polyhedra; polyhedra pene- 
tration depth measure is then used as a distance measure for the two datasets. 
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